This document is an overview of an example spam detection model, and how one might go about building a pipeline for training and evaluation of a simple spam detector using simple modeling components.
The notebook is broken down into several steps, and outlines some of the options/caveats that are inherent to the step. The notebook selects basic steps that build on simpler linear and neural network architectures, but don’t necessarily suggest that the chosen architectures are the “best” overall.
Introduction
Spam detection is the process of identifying and filtering unwanted or malicious messages, such as emails, SMS, or social media posts. Spam can range from harmless but annoying advertisements to dangerous phishing attempts and malware distribution. Given the vast volume of digital communication, automated spam detection systems are essential for maintaining security, efficiency, and user experience. This article walks through the construction of a spam classifier using open source models and tooling. The article itself is written as a jupyter notebook and rendered to html. The original jupyter notebook is avaialable here
Building an optimal spam detection model involves multiple considerations, including data collection, feature selection, model choice, and evaluation. The decision-making process typically follows these key steps:
Defining the Problem Scope
What type of spam needs to be detected (emails, SMS, social media posts, etc.)?
What is the acceptable trade-off between false positives (legitimate messages flagged as spam) and false negatives (spam messages that get through)?
Data Collection and Preprocessing
Gathering labeled datasets of spam and non-spam messages.
Cleaning the data by removing noise, tokenizing text, and handling missing values.
Augmenting data with additional signals like sender reputation, message structure, and frequency patterns.
Feature Engineering
Extracting relevant features such as word frequency, n-grams, TF-IDF scores, or embeddings from NLP models.
Incorporating metadata features (e.g., sender history, link presence, HTML content).
Model Selection
Choosing between rule-based systems, classical machine learning models (Naïve Bayes, SVMs, Random Forests), or deep learning approaches (LSTMs, Transformers).
Evaluating trade-offs between interpretability, computational cost, and effectiveness.
Training
Splitting data into training, validation, and test sets.
Evaluation
Using metrics like precision, recall, F1-score, and ROC-AUC to assess performance.
Implementing techniques like cross-validation and hyperparameter tuning to optimize the model.
Step 1 : Define Scope
For the purposes of this exercise we’ll look at e-mail messages that contain known spam or non-spam (“ham”) messages.
There’s a variety of techniques that are useful for detecting spam, including
Creating/Using instruction embeddings from larger models.
Pros: Possible to “program” embeddings using special instructions (give instructions that give more clear separation between records based on context).
Cons: Much more expensive and time consuming to execute.
Evolution of Text Representation Models: Size and Capabilities
Model Type
Typical Size
Dimensionality
Context Window
Training Data Requirements
Key Capabilities
Limitations
TF-IDF
Very small (KB to MB)
Sparse vectors (Dictionary size)
Document/corpus level
Minimal (Just the target corpus)
• Simple statistical word importance • Effective for document classification • Computationally efficient • No pre-training required
• No semantic understanding • No word relationships • Sparse representation • Fixed vocabulary
Word Vectors (word2vec, GloVe)
Small (100MB-1GB)
Dense vectors (50-300)
Word level
Medium (1B+ tokens)
• Captures semantic relationships • Word analogies (king - man + woman = queen) • Transfer learning for downstream tasks • Efficient inference
• Static word representations • No context sensitivity • No sentence-level understanding • Word ambiguity issues
Sentence Embeddings (USE, InferSent, SBERT)
Medium (1-5GB)
Dense vectors (512-1024)
Sentence level
Large (10B+ tokens)
• Sentence-level semantics • Better for similarity tasks • Cross-lingual capabilities • Effective for retrieval
• Limited contextual understanding • Fixed-length representations • Less effective for long documents • Limited compositional abilities
Small Transformers (BERT-base, RoBERTa-base)
Medium (0.5-1GB)
Contextual vectors (768-1024)
Limited (512 tokens)
Very large (30B+ tokens)
• Contextual word representations • Bidirectional context • Strong performance on many NLP tasks • Fine-tuning capabilities
• Limited context window • Moderate parameter efficiency • Training compute requirements • Still primarily linguistic understanding
• Enormous compute requirements • Training cost • Potential for biased outputs • “Black box” behavior • Challenging to interpret
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib.ticker import FuncFormatter# Create DataFrame with model datadata = {'name': ['TF-IDF', 'Word2Vec', 'GloVe', 'BERT-base', 'RoBERTa', 'GPT-2', 'T5-large', 'GPT-3', 'PaLM', 'GPT-4'],'year': [2000, 2013, 2014, 2018, 2019, 2019, 2020, 2020, 2022, 2023],'size': [0.001, 0.3, 0.5, 0.4, 0.5, 1.5, 3, 175, 540, 1500], # Size in GB'capability': ['Basic word importance/Document classification','Word relationships/Word analogies','Global corpus statistics/Improved semantic capture','Contextual representation/Bidirectional understanding','Optimized pre-training/State-of-art on benchmarks','Better text generation/Zero-shot learning','Text-to-text framework/Multi-task learning','Few-shot learning/Complex instructions','Chain-of-thought reasoning/Advanced problem solving','Nuanced reasoning/Multimodal understanding' ]}df = pd.DataFrame(data)# Sort by year for proper timelinedf = df.sort_values(by=['year', 'size'])# Create figure and axisplt.figure(figsize=(12, 8))ax = plt.subplot(111)# Plot with log scale for y-axis to handle the dramatic size differencesax.semilogy(df['year'], df['size'], marker='o', markersize=10, linewidth=2, color='#2563eb')# Format y-axis to show values nicelydef size_formatter(x, pos):if x <1:returnf"{x:.3f}"else:returnf"{int(x) if x ==int(x) else x:.1f}"ax.yaxis.set_major_formatter(FuncFormatter(size_formatter))# Add annotations for each modelfor i, row in df.iterrows():# Determine annotation placement (above or below point based on position)if row['size'] >10: y_offset =-1.2# Place below for large models va ='top'else: y_offset =1.2# Place above for small models va ='bottom'# Add model name ax.annotate(f"{row['name']}", xy=(row['year'], row['size']), xytext=(0, 20* y_offset), textcoords="offset points", ha='center', va=va, fontweight='bold', fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.9) )# Add capability text in smaller font ax.annotate(f"{row['capability']}", xy=(row['year'], row['size']), xytext=(0, 45* y_offset), textcoords="offset points", ha='center', va=va, fontsize=8, bbox=dict(boxstyle="round,pad=0.3", fc="#f0f7ff", ec="#c7dbff", alpha=0.9), wrap=True )# Add labels and titleplt.xlabel('Year', fontsize=12)plt.ylabel('Model Size (GB)', fontsize=12)plt.title('Growth in NLP Model Size (2000-2023)', fontsize=14, fontweight='bold')# Add grid for better readability (especially with log scale)plt.grid(True, which="both", ls="-", alpha=0.2)# Adjust the x-axis to give some paddingx_min, x_max = df['year'].min() -1, df['year'].max() +1plt.xlim(x_min, x_max)# Add a note about log scaleplt.figtext(0.5, 0.01, "Note: Y-axis uses logarithmic scale to visualize the exponential growth in model size", ha="center", fontsize=9, style='italic')# Layout adjustment to make space for annotationsplt.tight_layout(rect=[0, 0.03, 1, 0.95])# Save the figureplt.savefig('nlp_model_size_growth.png', dpi=300, bbox_inches='tight')# Show the plotplt.show()# Display the data as a tableprint("\nNLP Model Size and Capability Data:")print(df[['name', 'year', 'size', 'capability']].to_string(index=False))
NLP Model Size and Capability Data:
name year size capability
TF-IDF 2000 0.001 Basic word importance/Document classification
Word2Vec 2013 0.300 Word relationships/Word analogies
GloVe 2014 0.500 Global corpus statistics/Improved semantic capture
BERT-base 2018 0.400 Contextual representation/Bidirectional understanding
RoBERTa 2019 0.500 Optimized pre-training/State-of-art on benchmarks
GPT-2 2019 1.500 Better text generation/Zero-shot learning
T5-large 2020 3.000 Text-to-text framework/Multi-task learning
GPT-3 2020 175.000 Few-shot learning/Complex instructions
PaLM 2022 540.000 Chain-of-thought reasoning/Advanced problem solving
GPT-4 2023 1500.000 Nuanced reasoning/Multimodal understanding
## If you have jupyter lab running already, just uncomment this cell to install what you need.# !pip install polars sentence-transformers tqdm pyarrow altair ipywidgets pandas matplotlib
Step 2. Install Dependencies
Required Python Packages
The following Python packages are needed for the project:
We will call out specific libraries and their strengths and weaknesses.
Recommended: Install Miniconda
Using Miniconda is highly recommended to manage dependencies efficiently. Follow the instructions below to install Miniconda and set up an environment.
1. Install Miniconda
Mac & Linux
Run the following commands in a terminal:
# Download Miniconda installerwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh# OR for macOS (Intel)wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh# OR for macOS (Apple Silicon)wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -O miniconda.sh# Install Minicondabash miniconda.sh -b-p$HOME/miniconda# Initialize Conda$HOME/miniconda/bin/conda init
Restart your terminal for the changes to take effect.
import polars as pldf = pl.read_csv("https://raw.githubusercontent.com/bigmlcom/python/refs/heads/master/data/spam.csv", separator ="\t")df.head()
shape: (5, 2)
Type
Message
str
str
"ham"
"Go until jurong point, crazy..…
"ham"
"Ok lar... Joking wif u oni..."
"spam"
"Free entry in 2 a wkly comp to…
"ham"
"U dun say so early hor... U c …
"ham"
"Nah I don't think he goes to u…
Step 4: Feature Engineering
Let’s try MiniLM, a lightweight sentence embedding model.
We can create embeddings from the messages, and insert them back into the dataset as a separate column. Note how Jupyter allows one to use typed (f32x384) vectors as a column type. Pandas can not do this, and is one of the reasons Polars is recommended. However, Polars is still very new and is rapidly evolving. Make sure to stay up to date on documentation.
from sentence_transformers import SentenceTransformersentence_model = SentenceTransformer("all-MiniLM-L6-v2")
# Create the embeddings and save them efficiently as a numpy arrayembeddings = sentence_model.encode(df["Message"].to_numpy())# Bind the numpy array to the rest of the dataframedf = df.with_columns([ pl.Series(embeddings).alias("Message_Embeddings")])df.head()
shape: (5, 3)
Type
Message
Message_Embeddings
str
str
array[f32, 384]
"ham"
"Go until jurong point, crazy..…
[-0.016918, -0.038168, … -0.001258]
"ham"
"Ok lar... Joking wif u oni..."
[-0.013369, -0.04987, … -0.003396]
"spam"
"Free entry in 2 a wkly comp to…
[-0.015434, 0.063041, … 0.015645]
"ham"
"U dun say so early hor... U c …
[-0.012308, 0.037198, … -0.003828]
"ham"
"Nah I don't think he goes to u…
[0.0777, -0.132872, … 0.009034]
Step 5: Model Selection
There’s a number of different approaches we could take, but for the purposes of illustration, one has landed on using a simpler linear model on top of sentence embeddings.
We will define a training loop. Since we are borrowing a sentence embedder with MiniLM, we don’t need to train a new one. However, we will need to train a simpler linear model.
optimizer - choice of step function, in this case (most cases) adam.
epoch - number of complete passes through the data that the training routine has completed.
# This cell contains a mix of AI and Human Generated Code.import polars as plimport pandas as pdimport torchimport torch.nn as nnimport torch.optim as optimfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score# Convert labels to binary (spam = 1, ham = 0) and keep as a Polars expressiondf = df.with_columns((pl.col("Type") =="spam").cast(pl.Int32).alias("Label"))# Convert Polars columns directly to PyTorch tensorsX = torch.tensor(df["Message_Embeddings"], dtype=torch.float32) # Embeddings tensory = torch.tensor(df["Label"], dtype=torch.float32).unsqueeze(1) # Labels tensor# Split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Initialize modelinput_dim = X.shape[1] # Get embedding sizemodel = LogisticRegression(input_dim)# Define loss and optimizercriterion = nn.BCELoss() # Binary cross-entropy lossoptimizer = optim.Adam(model.parameters(), lr=0.01)# Training loopnum_epochs =100holdout_metrics = []for epoch inrange(num_epochs): model.train() optimizer.zero_grad() outputs = model(X_train) loss = criterion(outputs, y_train) loss.backward() optimizer.step()if epoch %10==0:# Evaluate the model on the holdout model.eval()with torch.no_grad(): y_pred = model(X_test) holdout_loss = criterion(y_pred, y_test) y_pred_labels = (y_pred >0.5).float() # Convert probabilities to binary (0 or 1)# Convert tensors to NumPy arrays for sklearn metrics y_test_np = y_test.numpy() y_pred_np = y_pred_labels.numpy()# Compute evaluation metrics accuracy = accuracy_score(y_test_np, y_pred_np) metrics = {"epoch" : epoch,"loss" : loss.item(),"holdout_loss" : holdout_loss.item(),"holdout_accuracy" : accuracy, } holdout_metrics.append(metrics)metrics = pd.DataFrame(holdout_metrics).set_index("epoch")metrics
loss
holdout_loss
holdout_accuracy
epoch
0
0.666757
0.643077
0.878788
10
0.472433
0.468307
0.878788
20
0.363610
0.375695
0.878788
30
0.300856
0.325027
0.878788
40
0.258253
0.291411
0.886364
50
0.226611
0.266079
0.901515
60
0.202605
0.246158
0.924242
70
0.183906
0.230198
0.939394
80
0.168782
0.217127
0.939394
90
0.156178
0.206159
0.939394
Step 7: Evaluation
We come to the all important evaluation step. We need to assess the results and determine if the model is optimized.
We need to compare the loss metrics. What does loss mean, exactly? Does the loss metrics make sense? (Are they going down, is one loss better than the other? Why?)
The holdout accuracy looks like 88%, which seems good. Is it good enough? What are some simple things we can do to make it better just by looking at the chart?
metrics.plot(title="Holdout Training Metrics by Epoch");
Future Steps
At this point, a decision is reached whether to release the model as-is, or think of ways of improving performance.
What are some pieces of information that could be added to the training embedding/vector? How?
What do we need to do to maintain this model? How would we detect model drift?
If we need to change from a binary classification model to a multi-class model, what needs to change? What metrics should be used?